Week 3
Old Dominion University
Forecast Variance
Trend Models
\[ GDP_t = \alpha + \beta_1i_t + \beta_2u_t + \epsilon_t\] \[ \hat{GDP}_t = \hat{\alpha} + \hat{\beta}_1i_t + \hat{\beta}_2u_t, \ \ \text{for} \ t \in \{0, ..., T\}\]
Suppose you estimate \(\hat{\alpha} = 10\), \(\hat{\beta_1} = 4\), and \(\hat{\beta_2} = -2\). Then, use this model for \(t \in \{T+1, ..., T + h\}\).
\[\hat{GDP}_t = 10 + 4i_t - 2u_t\]
We are almost always after \(\mu\), the population average.
However, we almost always need to settle for \(\overline{x}\):
\[ \overline{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \]
Of course, the population of \(x_i\)’s has some variance \(\sigma^2\). Again, we usually estimate this as \(s^2\).
Then, \(\overline{x}\) has a standard deviation (standard error): \(\sqrt{\frac{s^2}{n}}\)
Note: \(\overline{x} - \mu\) has the same variance as \(\overline{x}\)
If \(b_0\), the sample average, is used as the forecast for \(y_{T+h}\), then the prediction error is the sum of the forecast error and the estimation uncertainty:
\[y_{T+h} - b_0 = e_{T+h} + \beta_0 - b_0\]
Then, the variance of this can be decomposed into:
\[ var(e_{T+h}) \]
\[ \sigma^2 \]
\[ var(\beta_0 - b_0) \]
\[ \frac{\sigma^2}{T} \]
\[\sigma^2 + \frac{\sigma^2}{T}\]
\[(1 + \frac{1}{T})\sigma^2\]
\[ S_{T+h} = \sqrt{(1 + \frac{1}{T})\sigma^2} \]
Note that this is larger than the typical regression standard error: \(\sqrt{(\frac{1}{N})\sigma^2}\)
When we construct confidence intervals for a regular mean, we use the following formula:
\[\overline{x} \pm z\sqrt{(\frac{1}{n})s^2} \]
However, for predictions:
\[\overline{b_0} \pm z\sqrt{(1 + \frac{1}{T})s^2} \]
Read in Personal Consumption Expenditure data.
Run a regression with only a constant:
Call:
lm(formula = pc ~ 1, data = pce[!pce$horizon, ])
Residuals:
Min 1Q Median 3Q Max
-14.8748 -1.8748 -0.0248 1.9252 18.8252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.3748 0.2102 16.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.427 on 265 degrees of freedom
Confidence Interval:
fit lwr upr
3.374812 3.027930 3.721695
Prediction Interval:
plot(pce$date[!pce$horizon],
pce$pc[!pce$horizon], type = "l",
ylab = "Personal Consumption Expenditure",
xlab = "Quarter")
lines(pce$date[!pce$horizon], pred1[,1], col = "tomato")
lines(pce$date[!pce$horizon], pred1[,2], col = "dodgerblue", lty = 2)
lines(pce$date[!pce$horizon], pred1[,3], col = "dodgerblue", lty = 2)pred_horizon <- predict(reg, newdata = pce[pce$horizon,],
interval = "prediction",
level = .9)
lim <- !pce$horizon & pce$date >= ymd("2000-01-01")
plot(pce$date[lim],
pce$pc[lim], type = "l",
ylim = c(min(pce$pc),
max(pce$pc)),
xlim = ymd(c("2000-01-01", "2016-01-01")),
ylab = "Personal Consumption Expenditure",
xlab = "Quarter")
lines(pce$date[pce$horizon], pred_horizon[,1], col = "tomato")
lines(pce$date[pce$horizon], pred_horizon[,2], col = "dodgerblue", lty = 2)
lines(pce$date[pce$horizon], pred_horizon[,3], col = "dodgerblue", lty = 2)What if there is a policy introduced?
Drastic, sharp mean changes (i.e. policy changes, etc.):
are difficult to anticipate
“ruin” your forecasting model
Below, we have data on daily counts of both crime complaints and stop & frisk incidents.
crime2 <- tmp
aggregate(list(freq = snf$freq),
list(date = floor_date(snf$date, "week")),
sum) -> tmp
lim <- tmp$date < ymd("2012-05-01")
plot(crime2$date, crime2$freq,
xlab = "Week", ylab = "Frequency",
type = "l",
col = scales::alpha("black", .4),
ylim = c(min(tmp$freq[lim]), max(tmp$freq[lim])))
lines(tmp$date[lim], tmp$freq[lim],
col = "dodgerblue", lwd = 2)
abline(v = ymd("2008-01-01") + years(0:8),
lty = 2, col = scales::alpha("black", .4))plot(crime2$date, crime2$freq,
xlab = "Week", ylab = "Frequency",
type = "l",
col = scales::alpha("black", .4),
ylim = c(min(tmp$freq), max(tmp$freq)))
lines(tmp$date[lim], tmp$freq[lim],
col = "dodgerblue", lwd = 2)
abline(v = ymd("2008-01-01") + years(0:8),
lty = 2, col = scales::alpha("black", .4))
lines(tmp$date[!lim], tmp$freq[!lim],
col = "tomato", lwd = 2)lim_time <- crime2$date %in% ymd("2010-01-01"):ymd("2015-01-01")
plot(crime2$date[lim_time], crime2$freq[lim_time],
xlab = "Week", ylab = "Frequency",
type = "l",
col = scales::alpha("black", .4),
ylim = c(min(tmp$freq), max(tmp$freq)))
lines(tmp$date[lim & lim_time], tmp$freq[lim & lim_time],
col = "dodgerblue", lwd = 2)
# abline(v = ymd("2008-01-01") + years(0:8),
# lty = 2, col = scales::alpha("black", .4))
lines(tmp$date[!lim & lim_time], tmp$freq[!lim & lim_time],
col = "tomato", lwd = 2)
abline(v = ymd(c("2012-05-12", "2013-03-18", "2013-08-13")),
col = "mediumseagreen")\[E[y_{t+h}|\Omega_t] = \beta_0 + \beta_1*I(t > t^{*}) \]
How do we interpret \(\beta_1\)?
reg1 <- lm(freq ~ 1, data = tmp)
tmp$post1 <- ifelse(tmp$date > ymd("2012-05-12"), 1, 0)
tmp$post2 <- ifelse(tmp$date > ymd("2013-03-18"), 1, 0)
tmp$post3 <- ifelse(tmp$date > ymd("2013-08-12"), 1, 0)
reg2 <- lm(freq ~ post1 + post2 + post3, data = tmp)
library("stargazer")
stargazer(reg1, reg2, type = "html",
omit.stat = c("ser", "f"),
column.labels = "Stop & Frisk Count")| Dependent variable: | ||
| freq | ||
| Stop | Frisk Count | |
| (1) | (2) | |
| post1 | -3,439.785*** | |
| (272.580) | ||
| post2 | -3,843.540*** | |
| (448.532) | ||
| post3 | -3,107.412*** | |
| (391.836) | ||
| Constant | 6,840.966*** | 10,933.900*** |
| (221.941) | (101.428) | |
| Observations | 522 | 522 |
| R2 | 0.000 | 0.889 |
| Adjusted R2 | 0.000 | 0.888 |
| Note: | p<0.1; p<0.05; p<0.01 | |
plot(tmp$date[lim & lim_time], tmp$freq[lim & lim_time],
xlab = "Week", ylab = "Frequency",
type = "l", ylim = c(min(tmp$freq), max(tmp$freq)),
xlim = c(min(tmp$date[lim_time]),
max(tmp$date[lim_time])),
col = "dodgerblue", lwd = 2)
lines(tmp$date[!lim & lim_time], tmp$freq[!lim & lim_time],
col = "dodgerblue", lwd = 2)
abline(v = ymd(c("2012-05-12", "2013-03-18", "2013-08-13")),
col = "black")
lines(tmp$date, predict(reg1, tmp), lty = 2,
lwd = 2, col = "tomato")
lines(tmp$date, predict(reg2, tmp), lty = 3,
col = "mediumseagreen", lwd = 2)
legend("topright", bty = "n", lty = c(2:3),
legend = c("Reg 1", "Reg 2"), cex = 1.3,
col = c("tomato", "mediumseagreen"))Mean shifts are difficult to anticipate ex-ante.
How should we select these breaks?
rsqr <- c()
for(i in 10:(nrow(tmp)-11)){
tmp$post <- ifelse(tmp$date > (min(tmp$date) + weeks(i)), 1, 0)
regx <- lm(freq ~ post, tmp)
rsqr <- c(rsqr, summary(regx)$r.squared)
}
rsqr_plot <- c(rep(NA, 10), rsqr, rep(NA, 10))
max_date <- tmp$date[which(rsqr_plot == max(rsqr_plot, na.rm = TRUE))]
plot(tmp$date, rsqr_plot, ylab = "R-Squared",
xlab = c("Break Point", paste0("Date Selected: ", max_date)),
type = "l", lwd = 2,
ylim = c(0, 1))
lines(tmp$date, tmp$freq/max(tmp$freq),
col = scales::alpha("black", .4),
type = "l")
abline(v = max_date, col = "red")What if the mean is gradually changing over time?
Trend Model: \(y_{t} = f(\text{time}_t)\)
Different Types of Trend Models:
Linear:
\[y_t = \beta_0 + \beta_1 \text{Time}_t\]
Quadratic:
\[y_t = \beta_0 + \beta_1 \text{Time}_t + \beta_2 \text{Time}_t^2\]
Exponential:
\[y_t = e^{\beta_0 + \beta_1 \text{Time}_t} \implies log(y_t) = \beta_0 + \beta_1 \text{Time}_t\]
labor <- readRDS("../data/labor_particip.RDS")
lim <- labor$date < ymd("1992-01-01")
par(mfrow = c(1,2))
plot(labor$date[lim], labor$men[lim], type = "l",
xlab = "", ylab = "Participation Rate",
main = "Men")
plot(labor$date[lim], labor$women[lim], type = "l",
xlab = "", ylab = "Participation Rate",
main = "Women")par(mfrow = c(1,2))
sandp <- readRDS("../data/sandp.RDS")
lim <- sandp$date < ymd("1992-01-01")
plot(sandp$date[lim], sandp$volume[lim], type = "l",
xlab = "", ylab = "Sales")
plot(sandp$date[lim], log(sandp$volume[lim]), type = "l",
xlab = "", ylab = "Sales",
ylim = c(min(log(sandp$volume[lim])),
max(asinh(sandp$volume[lim]))),
col = scales::alpha("tomato", .6))
lines(sandp$date[lim], asinh(sandp$volume[lim]), type = "l",
xlab = "", ylab = "Sales",
col = scales::alpha("dodgerblue", .6))
legend("topleft", legend = c("log()", "asinh()"),
col = c("tomato", "dodgerblue"),
lty = 1, bty = "n")| Dependent variable: | ||
| women | ||
| (1) | (2) | |
| date | 0.002*** | |
| (0.00001) | ||
| time | 0.051*** | |
| (0.0004) | ||
| Constant | 42.120*** | 28.705*** |
| (0.057) | (0.113) | |
| Observations | 540 | 540 |
| R2 | 0.973 | 0.973 |
| Adjusted R2 | 0.973 | 0.973 |
| Note: | p<0.1; p<0.05; p<0.01 | |
pred <- predict(reg1, interval = "predict", level = 0.90)
plot(labor$date[lim], labor$women[lim], type = "l",
xlab = "", ylab = "Labor Force Participation")
lines(labor$date[lim], reg1$fitted.values,
col = "tomato", lty = 2)
lines(labor$date[lim], pred[,2], col = "dodgerblue")
lines(labor$date[lim], pred[,3], col = "dodgerblue")new_lim <- labor$date %in% ymd("1980-01-01"): ymd("2009-12-01")
new_pred <- predict(reg1, labor[!lim,], interval = "predict", level = 0.90)
plot(labor$date[new_lim & lim],
labor$women[new_lim & lim],
type = "l",
xlab = "", ylab = "Labor Force Participation",
xlim = c(min(labor$date[new_lim]), max(labor$date[new_lim])),
ylim = c(min(labor$women[new_lim], new_pred[,2]),
max(labor$women[new_lim], new_pred[,3])))
lines(labor$date[lim], reg1$fitted.values,
col = "tomato", lty = 2)
lines(labor$date[lim], pred[,2], col = "dodgerblue")
lines(labor$date[lim], pred[,3], col = "dodgerblue")
lines(labor$date[!lim], new_pred[,1], lty = 2)
lines(labor$date[!lim], new_pred[,2], col = "mediumseagreen")
lines(labor$date[!lim], new_pred[,3], col = "mediumseagreen")plot(labor$date[new_lim & lim],
labor$women[new_lim & lim],
type = "l",
xlab = "", ylab = "Labor Force Participation",
xlim = c(min(labor$date[new_lim]), max(labor$date[new_lim])),
ylim = c(min(labor$women[new_lim], new_pred[,2]),
max(labor$women[new_lim], new_pred[,3])))
lines(labor$date[lim], reg1$fitted.values,
col = "tomato", lty = 2)
lines(labor$date[lim], pred[,2], col = "dodgerblue")
lines(labor$date[lim], pred[,3], col = "dodgerblue")
lines(labor$date[!lim], new_pred[,1], lty = 2)
lines(labor$date[!lim], new_pred[,2], col = "mediumseagreen")
lines(labor$date[!lim], new_pred[,3], col = "mediumseagreen")
lines(labor$date[!lim], labor$women[!lim], lwd = 2)lim <- sales$date < ymd("1995-01-01")
sales$time <- interval(min(sales$date), sales$date) %/% months(1)
reg1 <- lm(sales ~ time, sales[lim,])
reg2 <- lm(sales ~ time + I(time^2), sales[lim,])
plot(sales$time[lim], sales$sales[lim], type = "l",
xlab = "", ylab = "Sales")
lines(sales$time[lim], reg1$fitted.values, col = "tomato")
lines(sales$time[lim], reg2$fitted.values, col = "dodgerblue")pred <- predict(reg2, sales[!lim,])
plot(sales$time[lim], sales$sales[lim], type = "l",
xlab = "", ylab = "Sales",
xlim = c(min(sales$time), max(sales$time)),
ylim = c(min(sales$sales, pred),
max(sales$sales, pred)))
lines(sales$time[lim], reg2$fitted.values, col = "dodgerblue")
lines(sales$time[!lim], pred, col = "dodgerblue", lty = 2)plot(sales$time[lim], sales$sales[lim], type = "l",
xlab = "", ylab = "Sales",
xlim = c(min(sales$time), max(sales$time)),
ylim = c(min(sales$sales, pred),
max(sales$sales, pred)))
lines(sales$time[lim], reg2$fitted.values, col = "dodgerblue")
lines(sales$time[!lim], sales$sales[!lim], lwd = 2)
lines(sales$time[!lim], pred, col = "dodgerblue", lty = 2)lim <- sandp$date < ymd("1990-01-01")
sandp$time <- interval(min(sandp$date), sandp$date) %/% weeks(1)
reg1 <- lm(log(volume) ~ time, sandp[lim,])
reg2 <- lm(asinh(volume) ~ time, sandp[lim,])
par(mfrow = c(1, 2))
plot(sandp$time[lim], sandp$volume[lim], type = "l",
xlab = "", ylab = "S&P Transactions")
lines(sandp$time[lim], exp(reg1$fitted.values),
col = scales::alpha("tomato", .4), lwd = 4)
lines(sandp$time[lim], sinh(reg2$fitted.values),
col = scales::alpha("dodgerblue", .4), lwd = 4)
plot(sandp$time[lim], asinh(sandp$volume[lim]), type = "l",
xlab = "", ylab = "Transformed S&P Transactions",
col = scales::alpha("dodgerblue", .6),
ylim = c(min(log(sandp$volume[lim])),
max(asinh(sandp$volume[lim]))))
lines(sandp$time[lim], log(sandp$volume[lim]),
col = scales::alpha("tomato", .6))
lines(sandp$time[lim], reg1$fitted.values,
col = scales::alpha("tomato", .4), lwd = 4)
lines(sandp$time[lim], reg2$fitted.values,
col = scales::alpha("dodgerblue", .4), lwd = 4)pred <- predict(reg2, sandp[!lim,], interval = "predict", .9)
pred <- pred$fit
plot(sandp$time[lim], sandp$volume[lim], type = "l",
xlab = "", ylab = "S&P Transactions",
xlim = c(min(sandp$time), max(sandp$time)),
ylim = c(0,
max(sinh(pred[,3]))))
lines(sandp$time[lim], sinh(reg2$fitted.values),
col = scales::alpha("dodgerblue", .6), lwd = 4)
lines(sandp$time[!lim], sinh(pred[,1]),
col = scales::alpha("dodgerblue", .6),
lwd = 4, lty = 2)
lines(sandp$time[!lim], sinh(pred[,2]),
col = scales::alpha("dodgerblue", .6),
lwd = 4, lty = 1)
lines(sandp$time[!lim], sinh(pred[,3]),
col = scales::alpha("dodgerblue", .6),
lwd = 4, lty = 1)plot(sandp$time[lim], sandp$volume[lim], type = "l",
xlab = "", ylab = "S&P Transactions",
xlim = c(min(sandp$time), max(sandp$time)),
ylim = c(0,
max(sandp$volume, pred[,3])))
lines(sandp$time[!lim], sandp$volume[!lim], lwd = 1)
lines(sandp$time[lim], sinh(reg2$fitted.values),
col = scales::alpha("dodgerblue", .6), lwd = 4)
lines(sandp$time[!lim], sinh(pred[,1]),
col = scales::alpha("dodgerblue", .6),
lwd = 4, lty = 2)
lines(sandp$time[!lim], sinh(pred[,2]),
col = scales::alpha("dodgerblue", .6),
lwd = 4, lty = 1)
lines(sandp$time[!lim], sinh(pred[,3]),
col = scales::alpha("dodgerblue", .6),
lwd = 4, lty = 1)Trends will understate uncertainty.
Uncertainty will increase with time horizon.
Forecasters usually predict growth rates, and then convert back to levels.
There is no economic theory behind trends!
\[y_t = \beta_0 + \beta_1 \text{Time}_t + \beta_2*I(\text{Time}_t > t^*)\]
\[y_t = \beta_0 + \beta_1 \text{Time}_t + \beta_2\text{Time}_t*I(\text{Time}_t > t^*)\]
\[y_t = \beta_0 + \beta_1 \text{Time}_t + \beta_2*I(\text{Time}_t > t^*) \\ + \beta_3\text{Time}_t*I(\text{Time}_t > t^*)\]
MAE - Mean Absolute Error: \(\frac{1}{T}\sum_{t = 1}^{T}|e_t|\)
RMSE - Root Mean Square Error: \(\sqrt{\frac{1}{T}\sum_{t = 1}^{T} e^2}\)
MAPE - Mean Absolute Percent Error: \(\frac{|e_{T+h}|}{|Y_{T+h}|}\)
* vs :
Functions
ECON 707/807: Econometrics II